---
title: "Python3 Unicode 字符串、编码"
date: 2020-9-30
categories:
- python
tags:
---

<div id="content">
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#org21dc399">Unicode String</a></li>
<li><a href="#org6323745">Byte String</a></li>
</ul>
</div>
</div>
<blockquote>
<p>
<a href="https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-strings/">https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-strings/</a>
</p>
</blockquote>
<div class="outline-2" id="outline-container-org21dc399">
<h2 id="org21dc399">Unicode String</h2>
<div class="outline-text-2" id="text-org21dc399">
<p>
Python3中所有字符串都是Unicode的。通过转义方式 {% raw %} '\uXXXX' {% endraw %} 来直接写一个字符，例如：
</p>
<div class="org-src-container">
<pre class="src src-python">&gt;&gt;&gt; <span style="font-weight: bold; font-style: italic;">euro</span> = <span style="font-style: italic;">'\u20AC'</span> <span style="font-weight: bold; font-style: italic;"># </span><span style="font-weight: bold; font-style: italic;">两字节</span>
&gt;&gt;&gt; euro
<span style="font-style: italic;">'€'</span>
&gt;&gt;&gt; <span style="font-weight: bold; font-style: italic;">smile</span> = <span style="font-style: italic;">'\U0001F642'</span> <span style="font-weight: bold; font-style: italic;"># </span><span style="font-weight: bold; font-style: italic;">四字节</span>
&gt;&gt;&gt; smile
<span style="font-style: italic;">'🙂'</span>
&gt;&gt;&gt; 
</pre>
</div>
<p>
python 标准模块 <a href="https://docs.python.org/3/library/unicodedata.html?#module-unicodedata">unicodedata</a> 中定义了一些获取Unicode字符信息的方法。详细的<a href="https://www.unicode.org/reports/tr44/">unicode定义</a>
</p>
</div>
</div>
<div class="outline-2" id="outline-container-org6323745">
<h2 id="org6323745">Byte String</h2>
<div class="outline-text-2" id="text-org6323745">
<p>
而在网络传输，文件读写时，Python3使用的是 byte string。通过 {% raw %} str = b'asdf' {% endraw %} 的方式定义一个字节字符串。
byte string 与 编码方式有关，例如UTF-8, GBK
</p>
<div class="org-src-container">
<pre class="src src-python">&gt;&gt;&gt; <span style="font-weight: bold;">str</span> = <span style="font-style: italic;">'中文ASDF'</span> <span style="font-weight: bold; font-style: italic;"># </span><span style="font-weight: bold; font-style: italic;">unicode string</span>
&gt;&gt;&gt; <span style="font-weight: bold;">str</span>.encode() <span style="font-weight: bold; font-style: italic;"># </span><span style="font-weight: bold; font-style: italic;">default utf-8</span>
b<span style="font-style: italic;">'\xe4\xb8\xad\xe6\x96\x87ASDF'</span>
&gt;&gt;&gt; <span style="font-weight: bold;">str</span>.encode(<span style="font-style: italic;">'GBK'</span>)
b<span style="font-style: italic;">'\xd6\xd0\xce\xc4ASDF'</span>
</pre>
</div>
<p>
unicode string 与 byte string 不同，例如： {% raw %} 'Hello world' == b'Hello world!' {% endraw %} 返回 False
byte string 提供decode方法，转为unicode string
</p>
<p>
unicode string encode -&gt; byte string
byte string decode -&gt; unicode string
</p>
</div>
</div>
</div>
<div class="status" id="postamble">
<p class="date">Date: 2020-9-30</p>
<p class="author">Author: gdme1320</p>
<p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
</div>
